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Background properties in experimental particle physics are typically estimated from large collections 
of events. This usually provides precise knowledge of average background distributions, but inevitably 
hides fluctuations. To overcome this limitation, an approach based on statistical mixture model decom- 
position is presented. Events are treated as heterogeneous populations comprising particles originating 
from different processes, and individual particles are mapped to a process of interest on a probabilis- 
tic basis. When used to discriminate against background, the proposed technique based on the Gibbs 
sampler allows features of the background distributions to be estimated directly from the data without 
training on high-statistics control samples. A feasibility study on Monte Carlo is presented, together 
with a comparison with existing techniques. Finally, the prospects for the development of the Gibbs 
sampler into a tool for intensive offline analysis of interesting events at the Large Hadron Collider are 
discussed. 
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1 Introduction 

Background discrimination in particle physics is 
usually performed by identifing events that are 
more likely to contain a physics process of interest, 
the primary goal being rejection of contributions 
from uninteresting processes that mimic the sig- 
nal and thus make its extraction and measurement 
more complicated. Traditional approaches achieve 
this goal by focussing on entire events, compar- 
ing kinematic and topological properties with ref- 
erence distributions usually obtained from control 
samples. 

This article presents a novel approach that 



builds on a population-based view of particle 
physics events, which are treated as collections 
of particles originating from different physics pro- 
cesses such as a hard scattering of interest as op- 
posed to background. The main goal is not to iden- 
tify events that are more likely to contain a given 
signal process, but rather to process collections of 
particles and map each of them to different pro- 
cesses on a probabilistic basis. 

This is achieved by applying mixture model 
decomposition techniques [lj, which are well es- 
tablished in statistics and which have been used 
in other disciplines to solve formally-similar prob- 
lems. In this formulation, events are treated as het- 
erogeneous statistical populations comprising par- 
ticles whose kinematics and topology reflect the 
process they originated from: an input collection of 
particles is processed using an iterative algorithm 
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that assigns individual particles a probability for 
them to be associated with a hard scattering of 
interest as opposed to background. 

This contribution describes a feasibility study 
for the use of the Gibbs sampler for back- 
ground discrimination at the Large Hadron Col- 
lider (LHC). 

The Gibbs sampler was originally proposed 
with regards to Bayesian restoration of images 
[2] and has subsequently been used to solve dif- 
ferent problems in a number of disciplines. The 
Gibbs sampler and, more generally, Markov Chain 
Monte Carlo (MCMC) techniques have also been 
recently applied to the study of the Cosmic Mi- 
crowave Background radiation [3]. As far as par- 
ticle physics is concerned, the last few years have 
witnessed a renewed interest in Bayesian numeri- 
cal methods in the context of a general effort to 
take advantage of the flexibility of Bayesian statis- 
tics for data analysis [1] [5] [6], in addition to the 
use of MCMC methods with reference to specific 
optimization problems [7]. 

In the context of this study, the Gibbs sampler 
iterates on an input collection of particles in order 
to decompose the corresponding mixture into par- 
ticles originating from a hard scattering of interest 
as opposed to background, and individual particles 
are mapped to signal or background on a prob- 
abilistic basis. Results obtained using the Gibbs 
sampler on a collection of ~ 600 simulated particles 
originating from a hard scattering and from back- 
ground are reported in this article, together with 
results from additional toy Monte Carlo studies. 

Although the proposed approach can be seen 
as a solution to a classification problem where in- 
dividual particles are considered as opposed to en- 
tire events, the technique presented here differs sig- 
nificantly from the simple application of existing 
classifiers to individual particles. Traditional clas- 
sification techniques in particle physics in fact rely 
on supervised algorithms, which are first trained 
on high-statistics control samples, on which they 
"learn" distinctive features of signal and back- 
ground signatures, and which are subsequently ap- 



plied to a test data set. This assumes that back- 
ground properties estimated from a high-statistics 
training sample also reliably describe background 
activity in the data set under study, which often 
comprises a significantly-lower number of particles 
than the control sample that was used to train the 
classifier. 

However, the above assumption is often wrong: 
different events in particle physics can in fact look 
very different from one another even when the un- 
derlying physics processes are the same, and sta- 
tistical fluctuations can be non-negligible in low- 
statistics data sets. When classification is per- 
formed using traditional supervised algorithms, 
fluctuations are usually not taken into account, 
since training typically relies on high-statistics con- 
trol samples. The Gibbs sampler as presented in 
this article can instead also be operated in unsu- 
pervised mode: this allows the algorithm to es- 
timate signal and background probability density 
functions (PDFs) directly from the data, and to 
map individual particles to signal or background 
based on those estimates instead of predefined 
high-statistics templates. 

From a broader perspective, this contribution 
describes a set of techniques that exploit the flexi- 
bility of the Gibbs sampler to estimate signal and 
background properties directly from the data, and 
to assign individual particles a probability for them 
to originate from either process. This article does 
not present an established method for background 
discrimination at the LHC, but rather discusses re- 
sults of a feasibility study for the use of the Gibbs 
sampler in different configurations, the primary ob- 
jective being the decomposition of an input col- 
lection of particles into signal and background- 
associated subpopulations. While the algorithm 
is not here proposed tool for intensive of- 

fline analysis of individual interesting events at the 
LHC, the total number of particles chosen for this 
proof of concept is in line with typical charged par- 
ticle multiplicities corresponding to LHC operat- 
ing conditions as of July 2011. For this reason, 
the present implementation of the algorithm is a 
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promising starting point with a view to develop- 
ing techniques based on the Gibbs sampler for the 
analysis of individual LHC events at the particle 
level. 

2 The algorithm 

This novel algorithm for background discrimina- 
tion is presented with reference to the general prob- 
lem of decomposing a collection of particles from 
high-energy particle collisions into subpopulations 
associated with different underlying physics pro- 
cesses and described in terms of different probabil- 
ity distributions. 

The input data set thus consists of a mixture 
of particles, some of which originated from a hard 
scattering of interest, others from a background 
process. Provided that the corresponding subpop- 
ulations can be characterized sufficiently well in 
terms of their kinematic or topological properties, 
it is in principle possible to ask, for each particle, 
what the probability is for it to originate from sig- 
nal as opposed to background. In particular, the 
Gibbs sampler can estimate such probabilities by 
iteratively sampling from the subpopulation PDFs. 

Although mixture model decomposition tech- 
niques based on the Gibbs sampler are well- 
established in statistics, classical algorithms are 
not suited for a direct application to the problem 
at hand without modification. Classical mixture 
models in fact usually rely on a parametric for- 
mulation that requires the shapes of the PDFs of 
the relevant subpopulations to be known a priori. 
However, our previous studies suggest that this 
assumption may be responsible for non-negligible 
bias when the Gibbs sampler is applied to the prob- 
lem at hand. The latter bias ultimately relates 
to the assumption that PDF functional forms ob- 
tained from high-statistics control samples are also 
appropriate to describe the corresponding proba- 
bility distributions in the input data set, which of- 
ten comprises a significantly-lower number of par- 
ticles, and in which statistical fluctuations may not 
be negligible. 



For this reason, the statistical model for the 
Gibbs sampler is here formulated in more general 
terms with respect to classical mixture models used 
in statistics, without requiring knowledge of the 
PDF functional forms of the relevant subpopula- 
tions. The model is based on the following PDF: 

k 
3=1 

with subpopulation fractions ay ("mixture 
weights") satisfying: 

k 

i=i 

The variable x can correspond to particle pseu- 
dorapidity r], a kinematic variable related to the 
polar angle of the particle, or pt i.e. the trans- 
verse momentum of the particle with respect to 
the beam direction. 

The subpopulation PDFs fj are defined in 
terms of regularized histograms of described 
in section O The symbol fj is used to denote 
the estimate of the generic subpopulation PDF fj 
throughout the text. 

Given the above mixture of probability distri- 
butions and given a set of observations {a?i}i=i ... jVj 
the problem of clustering the observations into 
k groups by probabilistically associating each of 
them with a distribution of origin has been solved 
numerically in a Bayesian framework using MCMC 
techniques, in particular using the Gibbs sampler 

The pseudocode of the algorithm is reported 
below. The value of variable v at iteration t is 
indicated with throughout. 

1. Initialization: Choose = {a^}j and 
fj —<Pj , j = l,...,k as described in section 

El 

2. Iteration t: 

(a) Generate the allocation variables zf- , 
i = 1, N, j = 1, k based on prob- 
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abilities P(zf) = l\o$- X) M^^i) 

proportional to a? ^ f(%i\<Pj ^). The 

quantity zf^ equals 1 when observation 
i is mapped to distribution j at iteration 
t, and otherwise. In general, the vari- 
ables zf- depend both on the mixture 
weights etj and on the estimates cpj of 
the subpopulation PDFs from the pre- 
vious iteration. 

(b) Generate a® from the probability den- 
sity function of a given z( t_1 ) = 

{ z ij ^K?' P(—\— Knowledge of 

which particles are mapped to process 
j at iteration t—1 makes it possible to 
generate the subpopulation fractions a 
at iteration t. 

(c) Obtain an updated estimate of the sub- 
population PDFs from the data x based 
on knowledge of which particles are 
mapped to subpopulation j at iteration 
t — 1. Details are provided in section [3j 

A specific choice for the function p and a way to ob- 
tain updated estimates of the subpopulation PDFs 
fj are described in section [3] with reference to a 
Monte Carlo study. 

The central idea of the algorithm is the follow- 
ing: the better the observations {xi}i are mapped 
to the subpopulations j = 1, ...,k, the more accu- 
rate the estimates of p(a\z) and of the subpopula- 
tion PDFs (fj. Once some correct values of Z{ j are 
found, p(a\z) and (fj begin to roughly reflect the 
correct distributions, which in turn leads to addi- 
tional correct mappings z%j to be found in subse- 
quent iterations. 

The Gibbs sampler can be operated both in su- 
pervised and unsupervised mode. The above pseu- 
docode corresponds to the "unsupervised Gibbs 
sampler", where updated estimates of subpopula- 
tion PDFs are obtained at each iteration, as indi- 
cated at step (c). On the other hand, when step 
(c) is removed from the pseudocode, particles are 
mapped to signal and background based on the 



subpopulation PDFs provided at initialization, and 
the algorihtm is operated in supervised mode ("su- 
pervised Gibbs sampler"). The primary objective 
of this contribution is to propose the use of the 
Gibbs sampler in different configurations in order 
to: 

• Obtain estimates <fj of the subpopulation 
PDFs from the input data set. 

• Estimate the subpopulation weights ay. In 
the context of this study, this corresponds to 
estimating the fractions of background and 
signal particles contained in the input data 
set. 

• Assign individual particles a probability for 
them to originate from a given process based 
on the subpopulation PDFs estimated at step 
(a) as opposed to relying on high-statistics 
templates. In the context of this study, this 
allows classification of individual particles 
into signal and background. 

This article reports the results of a feasibility 
study for one possible combination of the super- 
vised and unsupervised Gibbs sampler that was 
chosen with a view to minimizing systematic un- 
certainties. However, other modes of operation of 
the algorithm may be appropriate for specific ap- 
plications, and the exact configuration of the Gibbs 
sampler may have to be tailored to the analysis at 
hand. 

3 Monte Carlo study 

The Gibbs sampler was applied to a Monte Carlo 
data set generated using Pythia 8.140 |8j [9j, ob- 
tained superimposing gg — > ti signal events from 
pp interactions at y/s = 14 TeV to soft QCD inter- 
actions, so called Minimum Bias events, in order 
to simulate background. The signal process was 
chosen in order to illustrate the use of the algo- 
rithm for background discrimination at the parti- 
cle level. Further studies will be needed in order 
to extend these results beyond the proof of concept 
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(a) (b) 



Figure 1: Generator-level rj (top left) and pt (bottom left) distributions for signal (solid green histograms) 
and background particles (dashed red histograms) with 2 GeV/c < pr < 5 GeV/c from the high-statistics 
control sample. The distributions correspond to a total number of ~ 33, 000 particles and are normalized 
to unit area. The corresponding two-dimensional distribution is displayed on the right-hand side of the 
figure. 



presented in this article, and to assess the poten- 
tial of the Gibbs sampler for background discrim- 
ination in the context of specific analyses at the 
LHC. 

The Gibbs sampler was run over a collection of 
charged particles with 2 GeV/c < px < 5 GeV/c, 
and individual particles were assigned a probabil- 
ity for them to originate from signal as opposed to 
background based on their r\ and pt values. 

The pseudocode of the algorithm used for this 
application is shown below. Subscripts sig and 
relate to signal and background, respectively. 



(0) 



1. Initialization: Set a^kg = = 0.5, 

fj = (Pj • Initial conditions for the esti 
mates d 



of the subpopulation PDFs /, 



(0) 



are given by regularized rj and pt distribu- 
tions from the high-statistics control sample, 
as described in section [3~T1 

2. Iteration t: 



(a) Generate zf- for all particles (i = 
1,...,N) and distributions (j = 
1,2 corresponding to background 
and signal, respectively) according 



,(*) 



to /V-, ; 



. | (i-i) (t-l) 
II a.- \(fj ' ,Xi) oc 



3 
Oi2 



1 - 
.,(*) 



3 

Oibkg- 



where a\ 



1} / N > v i- This cor - 



(b) Set af .„ 

responds to the simplest choice of set- 
ting piajlzW) = 5( aj - z^/N) 
for the probability density function of a 



given z. 



(c) Obtain updated estimates of the sub- 
population PDFs by regularizing the r\ 
and pt distributions corresponding to 
particles mapped to the relevant sub- 
population at iteration t—1, i.e. based 
.(t-l) 



on z. 



■'3 



In general, the functions fj are the joint PDFs 
for r\ and px corresponding to background (j = 1) 
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and signal (j = 2) particles. This study is re- 
stricted to charged particles with 2 GeV/c < px < 
5 GeV/c, which makes it possible to neglect the 
correlation between r\ and pr as a first approxi- 
mation. For this reason, the joint PDFs take the 
form f sig/bkg = f^ g/bkg f^ bkg , and obtaining up- 
dated estimates of the subpopulation PDFs reduces 
to regularization of one-dimensional histograms, as 
described in the following. 

As for the number of iterations to be used with 
the Gibbs sampler, no rule is documented in the 
literature, and the choice is generally problem- 
dependent. The number of iterations was set to 
1,000 in this study, and probabilities were aver- 
aged over the last 100 iterations. Multiple runs 
were performed under different conditions and us- 
ing different numbers of iterations to make sure the 
algorithm converged. The issue of whether the sta- 
tionary distribution has been reached by the sam- 
pler has no clear solution in the literature, and 
establishedpractice relies on visual inspection of 
trace plots q 

In order to obtain initial conditions tp^ for the 
subpopulation PDFs, a Monte Carlo data set was 
used containing a total of about 33,000 charged 
particles in the kinematic range 2 GeV/c < px < 



,(°) 



this 



5 GeV/c. In addition to estimation of yr- 
high-statistics control sample was also used to 
guide the histogram regularization procedure as 
described in the following. Figure Q] (a) shows 
the rj and px distributions for signal (solid green 
histograms) and background particles (dashed red 
histograms) . 

As anticipated, one of the goals of the Gibbs 
sampler is to estimate the background PDFs di- 
rectly from the input collection of particles. In 
other words, the algorithm can classify particles 
into signal and background without using any pre- 
defined background templates: the background 
PDFs that are estimated by the algorithm thus re- 



flect the specific background conditions in the in- 
put data set, which can be different from "average" 
conditions estimated from a high-statistics control 
sample. 

The Gibbs sampler basically tries to uncover a 
signal and a background subpopulation in the in- 
put collection of particles based on the data and 
on initial conditions for the subpopulation PDFs. 
The results presented in this article relate to an 
input data set comprising 636 charged particles in 
the above-mentioned kinematic region 2 GeV/c < 
Pt < 5 GeV/c, out of which 481 originate from a 
signal hard process and 155 from background, cor- 
responding to a fraction of background particles of 
~ 24%. The total number of particles in the in- 
put data set is in line with typical charged particle 
multiplicities at the LHC as of July 2011. 

The signal and background T] and pt distribu- 
tions corresponding to the Monte Carlo data set 
used in this study are shown in figure [2l The 
solid green (dashed red) histograms in the upper 
panel display the signal (background) rj distribu- 
tions, normalized to unit area. The corresponding 
Pt distributions are given in the lower panel. 

Figure [3] shows an example of the x = rj distri- 
bution for particles mapped to the signal subpopu- 
lation at a given step of the Gibbs sampler. As op- 
posed to assuming a functional form for the PDF 
and fitting a function to the histogram, the his- 
togram is regularized i.e. the subpopulation PDF 
is obtained by means of spline interpolation of the 
histogram contents, as further discussed in the fol- 
lowing. The superimposed curve on the figure cor- 
responds to the regularized histogram, and is used 
by the Gibbs sampler as an estimate of the corre- 
sponding subpopulation PDF fj. 

This approach gives the algorithm more flex- 
ibility in terms of estimating the subpopulation 
PDFs directly from the input data set, and results 
in a significant reduction of the biases observed in 
our previous studies. 



1 In general, convergence of a MCMC time series can be linked to the number of draws from the sequence and to the positive 
auto-correlation between adjacent members |10j [11] , In particular, the standard deviation of the sequence with respect to 
the stationaty value scales as the inverse square root of the number of draws, with a multiplicative factor accounting for 
auto-correlation. 
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Figure 2: Particle r\ and pt distributions from the Monte Carlo input data set used in this study. Solid 
green and dashed red histograms correspond to signal and background, respectively. Distributions are 
normalized to unit area. 



■lllll.ll. 
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Figure 3: Example of the pseudorapidity rj distribution of particles mapped to the background subpop- 
ulation at a given step of the Gibbs sampler. The superimposed curve is the result of the regularization 
procedure described in the text, and is used by the Gibbs sampler as an estimate of the corresponding 
subpopulation PDF. 
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3 . 1 Regular izat ion 

Step (c) in the pseudocode shown in section [3] re- 
quires iterative PDF updates based on the current 
mapping of individual particles to the different sub- 
populations. This operation is performed when the 
Gibbs sampler is operated in unsupervised mode, 
and generally has a significant impact on the re- 
sults, as described below. 

In general, a histogram representing the 7] or 
Pt distribution of particles mapped to signal or 
background at a given iteration cannot be used 
directly as an estimate of the underlying PDF 
due to the presence of statistical fluctuations. In 
fact, a simple spline interpolation of the histogram 
contents, e.g. by means of cubic interpolation, 
generally leads to fluctuations being mistaken for 
genuine features of the underying PDFs, and the 
Gibbs sampler is typically unable to converge. This 
is similar to what happens when observed distri- 
butions in particle physics are unfolded in order 
to "remove" detector effects: regularization tech- 
niques have been investigated in detail in that con- 
text, for instance as a way to use a priori informa- 
tion about the underlying distributions in order to 
get rid of spurious oscillatory components (see e.g. 

inn- 

The complexity of the histogram regularization 
procedure that was used in this study was inten- 
tionally kept minimal in order to avoid the intro- 
duction of additional complications that may ob- 
scure the response of the algorithm at this stage 
of its development. The possible incorporation of 
more developed regularization methods is left for 
future studies. 

A priori information about the signal and back- 
ground PDFs was obtained from the high-statistics 
control sample. When subpopulation PDFs are es- 
timated during the execution of the Gibbs sampler 
in unsupervised mode, a "regularization window" 
is applied to the r] histograms in order to get rid of 
outliers: in other words, a spline interpolation of 
the histogram contents is obtained using only the 
part of the histogram that lies between a minimum 
and a maximum rj value, which leads to extreme 



statistical fluctuations on the tails of the distri- 
bution to be excluded. It is worth noticing that 
assuming a functional form for the PDF and fit- 
ting it to the histogram would effectively produce 
a similar result, i.e. would reduce the impact of 
outliers on the estimated PDF. However, as antic- 
ipated, that approach was observed to introduce 
significant bias in previous studies, and was thus 
abandoned in favour of the non-parametric statis- 
tical model presented in ([1]), where subpopulation 
PDFs are defined as the output of a histogram reg- 
ularization procedure without reference to any pre- 
defined functional form. Figure [4] shows signal and 
background r/ distributions from the high-statistics 
control sample, with superimposed arrows indicat- 
ing the "regularization window". The maximum 
\r]\ value was set to \rj\ = 5 (|?7| = 7) for signal 
(background) in this study. 

On top of this, again with a view to getting 
rid of extreme statistical fluctuations when regu- 
larizing histograms, boundary conditions were in- 
troduced on the T] and p? PDFs, constraining the 
value of fj to points chosen based on control sam- 
ple distributions: in particular, the signal (back- 
ground) rj PDF was constrained to when \rj\ > 5 
(|r/| > 7), and signal (background) px PDFs were 
constrained at 2 GeV/c and 5 GeV/c to 0.7 (1.2) 
and 0.1 (0) (see figure DJa)). 

Results were found to be stable with respect to 
reasonable changes to the regularization contraints 
reported above. 

3.2 Gibbs sampler configuration 

As anticipated, the Gibbs sampler can be operated 
both in supervised and unsupervised mode, and 
this article presents results obtained using a par- 
ticular configuration of the algorithm, correspond- 
ing to a combination of the two modes of operation 
chosen to minimize systematic uncertainties. 

As already pointed out in section [2l the Gibbs 
sampler may be used to process an input collection 
of particles in order to obtain one or more of the 
following results: 
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Figure 4: Illustration of the "regularization window" used in this study for histogram regularization. The 
histograms correspond to signal (solid green) and background (dashed red) rj distributions from the high- 
statistics control sample. The position of the solid green and dashed red arrows correspond to the "regular- 
ization window": at each iteration of the Gibbs sampler, only the part of the distribution that lies between 
the arrows is used for spline interpolation, which makes the results robust against outliers. Additional 
details are given in the text. 



1. Estimate the subpopulation PDFs from the 
input data set. 

2. Estimate the fraction of particles associated 
with a given process present in the input data 
set, e.g. the fraction of background particles. 

3. Assign individual particles a probability for 
them to originate from a given process, such 
as a hard scattering of interest as opposed to 
background, based on subpopulation PDFs 
estimated at step (1) as opposed to relying 
on predefined templates. 

Depending on the objective, it may be appro- 
priate to run the algorithm in different modes. 

For instance, the histogram regularization pro- 
cedure that is used to obtain iterative estimates 
of the subpopulation PDFs with the unsupervised 
Gibbs sampler directly leads to a bias on the mix- 
ture weights. For this reason, the unsupervised 
Gibbs sampler may not be the most appropriate 
tool to estimate the fraction of particles in the in- 
put data set originating from a given process, such 



as the fraction of background particles. On the 
other hand, when the algorithm is operated in un- 
supervised mode, it can obtain data-driven esti- 
mates of the subpopulation PDFs, a result that 
cannot be achieved using traditional supervised 
techniques. 

One possible combination of supervised and un- 
supervised Gibbs sampler is here proposed where: 

1. The supervised Gibbs sampler is first used 
to estimate the mixture weights. In the 
two-subpopulation scenario described in this 
study, goal (2) above corresponds to esti- 
mating the fraction of background particles 
contained in the input data set. The initial 
conditions correspond to a fraction of back- 
ground particles of 0.5, and the subpopula- 
tion PDFs at this stage are fixed to the esti- 
mates provided by the high-statistics control 
sample. The corresponding results are pre- 
sented and discussed in section [3.2.11 

2. The Gibbs sampler is run again on the input 
data set. This second run is performed in 
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unsupervised mode, i.e. subpopulation PDFs 
are now updated at each iteration, starting 
from initial conditions corresponding to reg- 
ularized distributions from the high-statistics 
control sample. On the other hand, the mix- 
ture weights are kept fixed to the results from 
the previous step. 

As for assigning individual particles in the in- 
put data set a probability for them to originate 
from signal as opposed to background, the most 
appropriate approach may again depend on the 
specific application. In general, probabilities may 
be assigned directly using the unsupervised Gibbs 
sampler at step (2) above, as done in this study, or 
an additional run of the supervised Gibbs sampler 
may altenatively be added after the previous two, 
with fixed PDFs given by the estimates from step 
(2). Further studies will be necessary in order to 
better understand the classification performance of 
the Gibbs sampler under different conditions and 
to guide this choice. 

Results obtained running the supervised and 
unsupervised Gibbs sampler as described above on 
the Monte Carlo data set used in this study are 
reported and discussed in the following sections. 

3.2.1 Supervised Gibbs sampler 

The supervised Gibbs sampler was primarily used 
in this study to estimate the mixture weights of 
the statistical model, i.e. the fraction of back- 
ground particles contained in the input data set. 
Figure [5] shows the corresponding estimates over 
the last 100 iterations. The solid green and dashed 
red curves correspond to the estimated fractions of 
signal and background particles, respectively. The 
solid green (dashed red) horizontal line indicates 
the signal (background) true value from the simu- 
lation, while the dash-dot line corresponds to the 
initial conditions for the mixture weights. 

These results indicate a bias on the estimated 
mixture weights. On the other hand, using the 
algorithm in supervised mode to estimate the sub- 
population mixture weights would introduce a de- 



pendence on the regularization parameters due to 
the histogram regularization procedure discussed 
in section I3.1| which is not desirable. This study 
has pointed to the following possible reasons for 
the observed bias on the mixture weights when the 
Gibbs sampler is operated in supervised mode: 

• Predefined PDF templates: when the algo- 
rithm is run in supervised mode, the subpop- 
ulation PDFs are not estimated from the data 
at each iteration, and particles are mapped 
to signal or background based on the PDF 
templates obtained from the high-statistics 
control sample. This can lead to a bias on 
the estimated mixture weights, which may be 
different for different input data sets. This is 
an irreducible source of bias on the supopula- 
tion mixture weights when the Gibbs sampler 
is operated in supervised mode. 

• "Sampling bias": Even when the PDF tem- 
plates obtained from the high-statistics con- 
trol sample provide a good description of the 
signal and background PDFs in the input 
data set, the use of those templates on a 
lower-statistics data set can still lead to a 
bias on the mixture weights. In general, the 
RMS of a histogram obtained sampling from 
a probability distribution in fact increases 
with the number of samples, and it may thus 
be necessary to correct the "high-statistics" 
PDF templates before applying them to a 
lower-statistics input data set in order to 
avoid bias. Future studies will investigate 
possible ways to mitigate this effect. 

In order to make sure that the bias in figure 
[5] is related to information provided to the Gibbs 
sampler based on the high-statistics control sample 
rather than to an intrinsic bias in the algorithm, 
additional runs on toy Monte Carlo samples were 
performed, as described in appendix |Aj In particu- 
lar, figure [6] displays the estimated mixture weights 
obtained running the supervised Gibbs sampler on 
a toy Monte Carlo data set with subpopulation 
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X) 

W 0.3 
0.2 




0.1 



gl 1 1 ! 1 1 ! 1 1 ! 1 
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Iteration number 

Figure 5: Mixture weights obtained running the supervised Gibbs sampler on the input Monte Carlo data 
set. Results from the last 100 iterations are shown. The solid green (dashed red) curve denotes the esti- 
mated fraction of signal (background) particles. The solid green (dashed red) horizontal line indicates the 
true value for signal (background) from the simulation, and the dash-dot line corresponds to the initial 
conditions for the mixture weights. The observed bias is discussed in the text. 



0.! 




0.1 



gl ! ! ! ! ! ! ! ! ! 1 

10 20 30 40 50 60 70 80 90 100 

Iteration number 

Figure 6: Mixture weights obtained running the supervised Gibbs sampler on a toy Monte Carlo data set, 
as described in the text. Results from the last 100 iterations are shown. The solid green (dashed red) 
curve corresponds to the estimated fraction of signal (background) particles. The solid green (dashed red) 
horizontal line indicates the true value for signal (background) from the toy Monte Carlo, and the dash-dot 
line corresponds to the initial conditions for the mixture weights. 
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PDFs kept fixed at truth information. In this case, 
no significant bias is expected, and the figure con- 
firms a better agreement between estimated and 
true mixture weights with respect to the corre- 
sponding results in figure \E\ 

3.2.2 Unsupervised Gibbs sampler 

The unsupervised Gibbs sampler was used in this 
study to estimate the signal and background PDFs 
from the input data set, while keeping the mix- 
ture weights fixed at the results obtained from the 
previous run of the algorithm in supervised mode. 

Figure [7] shows the subpopulation PDFs esti- 
mated by the Gibbs sampler on the Monte Carlo 
input data set. Solid curves correspond to the out- 
put of the histogram regularization procedure av- 
eraged over the last 100 iterations, superimposed 
to the true distributions (histograms). The r] {pt) 
distributions are displayed in the top (bottom) 
plots, figures on the left-hand (right-hand) side 
correponding to background (signal). All distri- 
butions are normalized to unit area. The bottom 
panel in each figure shows the corresponding ra- 
tio between the relevant subpopulation PDF esti- 
mated by the algorithm and truth information. 

In addition to obtaining data-driven estimates 
of the subpopulation PDFs in unsupervised mode, 
one of the goals of the Gibbs sampler in this appli- 
cation is to assign individual particles a probability 
for them to originate from a given process, such as 
a hard scattering of interest as opposed to back- 
ground. In this study, the latter probabilities were 
obtained from the same unsupervised run of the 
algorithm that provided the PDF estimates in fig- 
ure [7J In general, other choices are possible, such 
as performing an additional supervised run of the 
Gibbs sampler with subpopulation PDFs kept fixed 
at the estimates shown in figure 

A first comparison of the classification perfor- 
mance of the algorithm in the configuration cho- 
sen for this study with the corresponding perfor- 



mance of existing techniques is described in section 
13.31 Additional studies will be required for a de- 
tailed assessment of the classification performance 
of the algorithm corresponding to different opera- 
tion modes. 

The probabilities returned by the Gibbs sam- 
pler were validated by comparing the true kine- 
matic distributions with the corresponding ones for 
particles with P s i g > 0.5, P s i g being the probabil- 
ity for a given particle to originate from the signal 
process, averaged over the last 100 iterations of the 
algorithm. Results are shown in figure [8j where 
histogram bars indicate the n distribution for par- 
ticles with P s i g > 0.5, while stars correspond to 
true distributions. 



3.3 Classification performance 

Operating the Gibbs sampler as presented in this 
article is equivalent to using it binary clas- 

sifier. Its performance can thus be quantified us- 
ing the Receiver Operating Characteristic (ROC) 
curve, which displays true-positive as a function of 
false-positive probability. The area under the curve 
is a number between and 1: the higher its value, 
the better the classifier is able to discriminate be- 
tween the two categories (signal and background in 
this case). The ROC curve of a random classifier 
would be a straight line along the main diagonal 
on the true-positive vs false-positive plane ("chance 
diagonal" |13j). 

Figure [9] shows a comparison between the ROC 
curve obtained using the unsupervised Gibbs sam- 
pler on the input collection of particles used for 
this study and the corresponding curves from dif- 
ferent supervised multivariate classification meth- 
ods using TMVA pi] V04-01-00. The curves are 
displayed using an equivalent representation in 
terms of background rejection rate as a function 
of signal efficiencjH . The dashed lines refer to 



Based on our previous terminology, "background rejection rate" is equivalent to 1 — P; 



corresponds to P ai g^ ai g, 
signal subpopulation. 



where Pi 



bkg—>sig 



'bkg^sig, and "signal efficiency" 
j) is the probability for a background (signal) particle to be mapped to the 
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bkg p T sig p T 

Figure 7: Subpopulation PDFs estimated by the Gibbs sampler in unsupervised mode on the input Monte 
Carlo data set used in this study, averaged over the last 100 iterations. Top left: background r/. Top right: 
signal 77. Bottom left: background pt- Bottom right: signal pj<. In each subfigure, the upper panel shows 
truth information (histogram bars) superimposed to the result of the regularization procedure averaged over 
the last 100 iterations (curve). The lower panels display the ratio between subpopulation PDFs estimated 
by the algorithm and the corresponding truth-level information. 



0.3 - 




-6-4-2 □ 2 4 6 

Figure 8: Comparison between r\ and px distributions for particles with P s i g > 0.5 (histogram bars) and 
truth (stars). Additional information is given in the text. 
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Figure 9: Comparison between the ROC curve obtained using the Gibbs sampler in unsupervised mode, 
shown here as background rejection rate as a function of signal efficiency, and the corresponding curves 
obtained using existing supervised classification techniques from TMVA, as described in the text. The solid 
red line is the curve from the unsupervised Gibbs sampler corresponding to the last 100 iterations. The 
other curves correspond to TMVA algorithms, namely Boosted Decision Trees (dashed blue), Naive Bayes 
classification (dashed black), the Neural Network-based classifier MLPBNN (dashed green), and Linear 
Discriminant (dashed red). 



L(cm 1 ) 


(npu) 


N bkg /N sig 


10 33 


2.3 


0.1 


10 34 


23 


0.9 


10 35 


230 


4.4 



Table 1: Average numbers of pile- up particles (second column) expected at different LHC instantaneous 
luminosities (first column) |15j . A 25 ns bunch crossing is assumed. The third column reports the corre- 
sponding ratios between the number of background and signal particles observed in the kinematic region 
considered in this study. 
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Figure 10: Background contamination as a function of signal efficiency at different LHC instantaneous 
luminosities (10 33 cm~ 2 s solid black, 10 34 cm _2 s~ 1 dashed blue, 10 35 cm _2 s _1 dotted green). Background 
contamination is denned as the number of misclassified background particles normalized to the number of 
signal particles. Misclassification probabilities correspond to the ROC curve from the Gibbs sampler in 
figure [9l A 25 ns bunch crossing is assumed |15| . 



different TMVA method^, namely Boosted Deci- 
sion Trees (dashed blue), Naive Bayes classification 
(dashed black), the Neural Network-based classi- 
fier MLPBNN (dashed green) , and Linear Discrim- 
inant (dashed red). 

The solid red line corresponds to the Gibbs 
sampler. The figure suggests that classification 
performance of the Gibbs sampler is similar to that 
of existing supervised methods. 

It may also be useful to provide a more pre- 
cise idea of the background rejections and signal 
efficiencies that can be achieved using the Gibbs 
sampler corresponding to different LHC instanta- 
neous luminositie£|. Figure \W\ shows estimates of 
background contamination as a function of signal 
efficiency at three different LHC instantaneous lu- 
minosities. Background contamination is defined 
as the number of misclassified background particles 
normalized to the number of signal particles, and 
is calculated by rescaling the abscissa of the ROC 



curve by the ratio between the number of back- 
ground and signal particles in the kinematic region 
considered in this study, as given in table □i The 
three curves correspond to instantaneous luminosi- 



10 34 cm 2 s~ 



ties of 10 33 cm 2 s 1 (solid black), 
(dashed blue), and 10 35 cm _2 s _1 (dotted green). 



3.4 Additional remarks 

The possibility to be operated in unsupervised 
mode distinguishes the Gibbs sampler from exist- 
ing multivariate approaches such as those available 
in ROOT [16] with TMVA. However, the iterative 
nature of the algorithm leads to a disadvantage 
with respect to established multivariate algorithms 
in terms of execution time. The running time of the 
Gibbs sampler corresponding to 1,000 iterations on 
the Monte Carlo input data set used in this study 
was about 20 s on a 2GHz Intel Processor with 
1GB RAM. 



3 The algorithms were run using the high-statistics control sample for training and the same collection of particles the 
Gibbs sampler was run on for testing. 

4 The expected average number of pile-up interactions, i.e. the expected average number of primary vertices in the events 
is here taken as a measure of background activity for illustrative purposes. 

The average numbers of pile-up interactions at different LHC instantaneous luminosities are taken from |15| . and corre- 
spond to a 25 ns bunch crossing. 
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However, given the parallelization potential of 
the Gibbs sampler that was pointed out already in 
the original paper [2J, improvements may be pos- 
sible in this respect, for example using commodity 
Graphics Processing Units (GPUs) that are exten- 
sively used both in particle physics and in other 
disciplines for compute-intensive applications. 

4 Conclusions and outlook 

This contribution has presented a feasibility study 
for a novel approach to background discrimination 
in particle physics that builds on a population- 
based view of events from high-energy particle col- 
lisions. Collections of particles are here treated as 
mixtures of subpopulations associated with differ- 
ent physics processes, and numerical methods well 
established in statistics for mixture model decom- 
position are used to assign individual particles a 
probability for them to originate from a hard scat- 
tering of interest as opposed to background. This 
application of the Gibbs sampler to a classification 
problem at the particle level allows the develop- 
ment of a flexible suite of tools to estimate back- 
ground properties directly from the data, e.g. to 
obtain PDF shapes without relying on templates 
obtained from high-statistics control samples and 
without assuming predefined functional forms. 

This study has highlighted strength and lim- 
itations of the algorithm operated in supervised 
and unsupervised mode, and a possible combina- 
tion has been proposed to reduce bias. In general, 
systematic uncertainties associated with the use of 
the Gibbs sampler will have to be evaluated in the 
context of a given analysis. 

Understanding in detail how the classification 
performance of the algorithm in different configu- 
rations compares to existing techniques will require 
further study, as will the possible development of 
subsequent versions optimized in terms of execu- 
tion time building on the inherent parallelizability 
of the Gibbs sampler. 

As anticipated, the total number of particles in 
the Monte Carlo data set used in this study is in 



line with typical charged particle multiplicities at 
the LHC corresponding to operating conditions as 
of July 2011. For this reason, the results presented 
in this article are a promising starting point for 
futher development, with a view to building ded- 
icated software tools based on the Gibbs sampler 
for intensive offline analysis of individual interest- 
ing events at the LHC. 
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A Appendix: Toy Monte Carlo 
studies 

Results from the Monte Carlo study described in 
section [3] were crosschecked on toy Monte Carlo 
data sets. Samples of ~ 600 signal and back- 
ground particles were generated according to r] and 
Pt distributions similar to those obtained from 
Pythia. Particle rj and pt were generated inde- 
pendently: Gaussian PDFs centered at zero with 
standard deviations comparable to those observed 
in the Monte Carlo were used for rj, and px values 
were generated based on polynomial PDFs in the 
range 2 GeV/c < pr < 5 GeV/c parametrizing the 
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corresponding Monte Carlo distributions. 

Additional crosschecks were performed by vary- 
ing both toy Monte Carlo generation parameters 
and seeds, in order to make sure that results did 
not depend on a specific parameter choice. The 
algorithm was also run on different numbers of 
particles in the input data set, with no significant 
changes to the results. 

As mentioned when discussing the bias on mix- 
ture weights in section [3.2. 1| toy Monte Carlo data 
sets made it possible to rule out an intrinsic bias 
in the supervised Gibbs sampler as far as mixture 
weights are concerned. The following remarks can 
be made with reference to the results of the su- 
pervised Gibbs sampler on the toy Monte Carlo 
samples: 

• Particle r\ and px are statistically indepen- 
dent by construction, so that the correlation 
between the two variables in the data set 
used for the Monte Carlo study, which is any- 
way expected to have a small effect, cannot 
contribute to a bias on mixture weights on 
the toy Monte Carlo samples. 

• The above-mentioned "sampling bias" due 
to subpopulation PDFs from high-statistics 
samples being "broader" than the corre- 
sponding lower-statistics distributions is not 
relevant when the PDFs used with the super- 
vised Gibbs sampler come from truth infor- 
mation. 

The above-mentioned figure [6] shows typical es- 
timates of mixture weights obtained running the 
supervised Gibbs sampler on a toy Monte Carlo 
data set using PDFs from truth information. Re- 
sults from the last 100 iterations are shown. Solid 
green and dashed red curves correspond to esti- 
mated fractions of signal and background particles, 
respectively. The solid green (dashed red) horizon- 
tal line indicates the true value for signal (back- 
ground) from the toy Monte Carlo, and the dash- 
dot line corresponds to the initial conditions for 
the mixture weights. The agreement with the true 



weights is significantly better than the correspond- 
ing results obtained running the Gibbs sampler on 
the Monte Carlo data set. 
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